Each NSC gives rise to 6 progeny cells and the progeny cells have less mutant mtDNA than the NSCs they derive from. There is extensive progeny cell-to-progeny cell variability in heteroplasmy levels, even for those progeny cells derived from the same NSC.
Load and preprocess data
data_dir <-here("data")plot_dir <-here("figures")if (!dir.exists(plot_dir)) {dir.create(plot_dir, recursive =TRUE)}# Load excel datacell_volume_data <-read_excel(here(data_dir, "NSC-progeny cell volume.xlsx"))mtDNA_number_data <-read_excel(here(data_dir, "NSC-progeny FISH mtDNA number.xlsx"))# Clean column namescell_volume_data <- cell_volume_data %>%clean_names()mtDNA_number_data <- mtDNA_number_data %>%clean_names()cell_volume_data <- cell_volume_data %>%rename("cell_type"="x1")# Rename columns using janitormtDNA_number_data <- mtDNA_number_data %>%rename_with(~gsub("mt_dna_", "mtDNA_", .), starts_with("mt_dna"))# Colours for plotsNSC_color <-"#7384E9"progeny_color <-"#97DDEC"sim_color <-"#E64B35B2"# Labels for lineages in plotslineage_labels <-setNames(paste("Lineage", sort(unique(mtDNA_number_data$lineage))),sort(unique(mtDNA_number_data$lineage)))# Calculate heteroplasmy for each observationmtDNA_number_data <- mtDNA_number_data %>%mutate(heteroplasmy = (mtDNA_mut / mtDNA_total) *100)
Observed volumes and mtDNA copy number
There is an approximate reduction of 12-14x in cellular volume from NSCs to progeny cells.
Figure 1: Cellular, cytoplasmic, and nuclear volumes in NSCs and progeny cells. Boxplots show median, interquartile range, and individual data points. NSCs exhibit approximately 12-14× larger volumes than progeny cells across all compartments (Wilcoxon rank-sum test, p < 0.001 for all comparisons).
Figure 2: mtDNA copy number in NSCs and progeny cells. Each dot represents a single cell, colored by lineage. NSCs contain significantly more mtDNA molecules than progeny cells (Wilcoxon rank-sum test, p < 0.001), with approximately 12× reduction consistent with cellular volume changes.
The mtDNA copy number was significantly lower in progeny cells compared to NSCs (Wilcoxon rank-sum test, p = 2.5e-09).
Comparison of mtDNA copy number and cellular volume between NSCs and progeny cells
NSC (Mean ± SD)
Progeny (Mean ± SD)
Fold Reduction
mtDNA Copy Number
208.7 ± 68.4
16.4 ± 7.1
12.7×
Cellular Volume (μm³)
115.9 ± 38.8
9.5 ± 4.2
12.2×
There is an approximate 12-14x reduction in mtDNA copy number from NSCs to progeny cells, similar to the reduction in cellular volume.
Observed heteroplasmy levels
The variance in the heteroplasmy levels for progeny cells is higher compared to NSCs. The average heteroplasmy levels are lower in progeny cells compared to NSCs.
Figure 3: Heteroplasmy levels in NSCs and individual progeny cells for each of 14 lineages. Each NSC (left) gives rise to ~6 progeny cells (right). Progeny cells show increased variance and generally lower heteroplasmy compared to their parent NSC, with substantial cell-to-cell variability even among progeny from the same NSC.
Comparison of heteroplasmy levels between NSCs and progeny cells
NSC (Mean ± SD)
NSC (Median, IQR)
Progeny (Mean ± SD)
Progeny (Median, IQR)
Fold Change
P-value
Heteroplasmy (%)
66.4 ± 5.7
64.5 (IQR: 6.8)
48.6 ± 11.8
50.0 (IQR: 16.8)
1.37×
7.6e-07
The mean heteroplasmy level was significantly lower in progeny cells compared to NSCs (Wilcoxon rank-sum test, p = 7.6e-07).
Plotting the heteroplasmy levels for all NSCs and progeny cells across all lineages.
Figure 4: Heteroplasmy levels in NSCs and progeny cells across all lineages
Plotting the observed heteroplasmy in NSCs and progeny cells by Lineage
Figure 5: Heteroplasmy distributions in NSCs and progeny cells faceted by lineage. Each panel shows one NSC (left) and its ~6 progeny cells (right). Points are colored by lineage. The shift toward lower heteroplasmy and increased variance in progeny is evident across all lineages, though the magnitude varies
Mixed-effects models for heteroplasmy variances
To test whether the heteroplasmy variances differ between NSCs and progeny cells, we used linear mixed-effects models (Pinheiro and Bates, 2000) with cell type as a fixed effect and lineage as a random effect. We compared a null model assuming equal variances (Model 0) to an alternative model allowing for different variances in the two cell types (Model 1) using a likelihood ratio test. For indices \(i\) representing cell lineage, \(j\) representing individual observations, \(k\) representing cell type (NSC or Progeny) of observation \(j\) and heteroplasmy \(h_{ij}\), the models are defined as follows:
\(u_i \sim N(0, \sigma_{u}^{2})\)\(\to\) random intercept for lineage \(i\)
\(\epsilon_{ij} \sim N(0, \sigma_{k}^{2})\)\(\to\) residual error with variance \(\sigma_{k}^{2}\) specific to cell type \(k\)
Raw heteroplasmy data was logit transformed and squeezed to avoid issues with 0 and 100% values before fitting the models. The \(\delta\) value was set to \(\frac{1}{2N_{max}}\), where \(N_{max}\) is the largest observed mtDNA copy number, to avoid issues with zero or one heteroplasmy values while minimizing bias. \[logit(h) = ln (\frac{h}{1-h})\]
Models were fit with Maximum Likelihood (ML) for the likelihood ratio test and with Restricted Maximum Likelihood (REML) for parameter estimation.
Linear mixed-effects modeling of heteroplasmy variances
# Logit transform heteroplasmy to fit normality assumptionsmtDNA_logit_data <- mtDNA_number_data %>% dplyr::mutate(heteroplasmy_prop = heteroplasmy /100,# Calculate epsilon based on the largest observed mtDNA pool in the dataepsilon =1/ (2*max(mtDNA_total, na.rm =TRUE)),heteroplasmy_prop_squeezed =pmax(epsilon, pmin(1- epsilon, heteroplasmy_prop)),logit_heteroplasmy = car::logit(heteroplasmy_prop_squeezed) )# Model 0 (null model): Equal varianceslme_null_model <-lme( logit_heteroplasmy ~ cell_type,random =~1| lineage,data = mtDNA_logit_data,method ="ML")# Model 1 (alternative model): Unequal varianceslme_alt_model <-lme( logit_heteroplasmy ~ cell_type,random =~1| lineage,weights =varIdent(form =~1| cell_type),data = mtDNA_logit_data,method ="ML")# Likelihood ratio testlrt_result <-anova( lme_alt_model, lme_null_model)lrt_l_ratio <-round(lrt_result$L.Ratio[2], 2)lrt_p_value <-format( lrt_result$`p-value`[2], scientific =FALSE, digits =4)# Summary of the best-fit modellme_alt_summary <-summary(update(lme_alt_model, method ="REML"))# Extract coefficients and p-valuestTable <- lme_alt_summary$tTablebeta_progeny <- tTable["cell_typeProgeny", "Value" ]se_progeny <- tTable["cell_typeProgeny", "Std.Error" ]p_progeny <- tTable["cell_typeProgeny", "p-value" ]
Figure 6: Diagnostic plots for the linear mixed-effects model on logit-transformed heteroplasmy
Figure 7: Q-Q plot of residuals for the linear mixed-effects model on logit-transformed heteroplasmy
The residuals vs. fitted values plot shows no systematic pattern, indicating that the model assumptions of constant variance and linearity are met. The Q-Q plot demonstrates that the residuals are approximately normally distributed, supporting the validity of the linear mixed-effects model.
Including mtDNA copy number as a covariate did not significantly improve the model fit (LRT p = 0.1479), suggesting that the observed heteroplasmy differences are not driven by differences in mtDNA copy number.
Figure 8: Heteroplasmy variance in NSCs and progeny cells. Violin plots show kernel density estimates, with boxplots (median and IQR) and individual data points overlaid. Progeny cells exhibit significantly greater heteroplasmy variance than NSCs (LRT p < 0.001 for unequal variance model), despite having lower mean heteroplasmy. Each point represents a single cell
The linear mixed-effects model on the logit-transformed heteroplasmy data revealed that the odds ratio for progeny cells vs NSC was 0.47 (\(\beta\) = -0.751, SE = 0.069, p < 0.001), indicating that progeny cells had 53% lower odds of carrying mutant mtDNA compared to NSCs. Progeny cells also showed significantly greater heteroplasmy variability than NSCs, as determined by a likelihood ratio test (LRT) comparing the model allowing for different variances to a null model assuming equal variances (LRT = 11.39, p = 0.000738).
Simulations of mtDNA segregation from NSCs to progeny cells
To test whether the observed heteroplasmy variance and shifts in progeny cells could arise purely from stochastic mtDNA partitioning, we simulated heteroplasmy under a model of random drift.
The simulation strategy was as follows:
Simulations were performed independently for each observed cell lineage.
For each lineage, a virtual parent NSC was created containing the same number of wild-type and mutant mtDNA molecules as observed experimentally.
For each progeny cell, a corresponding total mtDNA copy number was assigned based on the experimentally observed values.
A simulated sample of mtDNA molecules was drawn from the parent pool using hypergeometric sampling, modeling partitioning without replacement so that molecules assigned to one progeny cell were removed from the pool and unavailable for subsequent draws.
The heteroplasmy level for each simulated progeny cell was then calculated as the ratio of mutant molecules to the sampled mtDNA total.
The simulation of the entire lineage was repeated 10,000 times to generate a null distribution of heteroplasmy levels for each progeny cell.
Simulation functions for modeling random drift
simulate_lineage_hypergeometric <-function( nsc_wt, nsc_mut, observed_totals, n_draws =999) { n_progeny <-length(observed_totals) simulated_draws <-replicate(n_draws, {# Initialize the pool for this single simulation run remaining_wt <- nsc_wt remaining_mut <- nsc_mut# Create empty vectors to store the results for each progeny cell wt_samples <-numeric(n_progeny) mut_samples <-numeric(n_progeny)# Loop through each progeny cell, drawing from and then shrinking the poolfor (i inseq_len(n_progeny)) {# The total mtDNA available for this draw current_total_pool <- remaining_wt + remaining_mut# If the current pool is empty, we cannot draw any mtDNAif (current_total_pool ==0) { wt_samples[i] <-0 mut_samples[i] <-0next }# The number of molecules to sample for this progeny cell k_draw <-min(observed_totals[i], current_total_pool)# Hypergeometric draw without replacement from the current pool wt_samples[i] <-rhyper(1,m = remaining_wt, # Number of wild-type mtDNA availablen = remaining_mut, # Number of mutant mtDNA availablek = k_draw # Number of molecules to sample )# Calculate the number of mutant mtDNA drawn mut_samples[i] <- k_draw - wt_samples[i]# Update the remaining pools for the next draw remaining_wt <- remaining_wt - wt_samples[i] remaining_mut <- remaining_mut - mut_samples[i] }# Calculate heteroplasmy for each progeny cell in this draw heteroplasmy_values <- (mut_samples / observed_totals) *100# Handle cases where observed totals in 0 to avoid Nan heteroplasmy_values[is.nan(heteroplasmy_values)] <-0# Return a data frame with simulated values for this single drawdata.frame(mtDNA_total = observed_totals,mtDNA_wt = wt_samples,mtDNA_mut = mut_samples,heteroplasmy = heteroplasmy_values ) },simplify =FALSE )return(simulated_draws)}# Simulate mtDNA distribution in progeny cells for all lineagessimulate_all_lineages <-function(mtDNA_data, n_draws =999) {# Split data by lineage lineage_groups <- mtDNA_data %>% dplyr::group_by(lineage) %>% dplyr::group_split() all_simulations <-list()for (lineage_data in lineage_groups) {# Extract NSC data for this lineage nsc_data <- lineage_data %>%filter(cell_type =="NSC")# Extract progeny data for this lineage progeny_data <- lineage_data %>%filter(cell_type =="Progeny")# Extract mtDNA_wt and mtDNA_mut for NSC nsc_wt <- nsc_data$mtDNA_wt nsc_mut <- nsc_data$mtDNA_mut# Get observed mtDNA_total for progeny cells observed_progeny_totals <- progeny_data$mtDNA_total# Simulate mtDNA distribution in progeny cells for this lineage lineage_simulations <-simulate_lineage_hypergeometric(nsc_wt = nsc_wt,nsc_mut = nsc_mut,observed_totals = observed_progeny_totals,n_draws = n_draws )# Store the simulations all_simulations[[as.character(unique(nsc_data$lineage))]] <- lineage_simulations }return(all_simulations)}
We tested the convergence of the simulation results by running the simulations with different sample sizes (1,000; 5,000; 10,000; and 20,000) and comparing the resulting heteroplasmy distributions and summary statistics.
Test simulation convergence across different sample sizes
Figure 9: Convergence of heteroplasmy distributions across different simulation sample sizes. Kernel density estimates of heteroplasmy levels from simulations with varying numbers of draws (1,000; 5,000; 10,000; and 20,000) are shown. The distributions overlap closely, indicating convergence of the simulation results as sample size increases.
Convergence of Simulation Statistics Across Sample Sizes
N Simulations
Mean Het (%)
SD Het (%)
Median Het (%)
Variance
25th %ile
75th %ile
N Unique
1,000
66.35
13.42
66.67
180.11
57.14
75
227
5,000
66.34
13.38
66.67
179.09
57.14
75
248
10,000
66.33
13.39
66.67
179.26
57.14
75
252
20,000
66.31
13.40
66.67
179.68
57.14
75
255
The summary statistics were stable across all the sample sizes tested, indicating that the simulations achieved convergence. We used a sample size of 10,000 for hypothesis testing.
Monte-Carlo test of normalized heteroplasmy variance
We compared the normalized heteroplasmy variance (Wonnapinij et al., 2010) for the observed progeny cells to the null distribution generated from the simulations.
For a cell population with heteroplasmies \((h = h_1, h_2, \ldots, h_n)\), the normalized variance is defined as:
\[V'(h) = var(h) / \bar{h}(1 - \bar{h})\]
where \(\bar{h}\) is the mean heteroplasmy of the population and heteroplasmies are expressed as proportions (0-1).
To test whether the observed variance is consistent with random segregation, we:
Calculated the mean normalized variance across lineages from observed data
Generated a null distribution of mean normalized variances from 10,000 simulations under random hypergeometric partitioning
Performed a two-sided Monte Carlo test comparing the observed statistic to the null distribution
The p-value was calculated as the proportion of simulated values that deviated from the null distribution mean by at least as much as the observed value:
where \(S_{obs}\) is the observed mean normalized variance, \(S_i\) are the simulated values, \(\bar{S}\) is the mean of the null distribution, and \(n = 10{,}000\). This two-tailed test assesses whether the observed variance is unusually high or low compared to random segregation.
Monte Carlo test for normalized heteroplasmy variance
# Function to calculate normalized heteroplasmy variancecompute_normalized_variance <-function(heteroplasmy_values) {# Ensure heteroplasmy is on a 0-1 scale (proportion)if (max(heteroplasmy_values, na.rm =TRUE) >1) { heteroplasmy_values <- heteroplasmy_values /100 } mean_h <-mean(heteroplasmy_values, na.rm =TRUE) var_h <-var(heteroplasmy_values, na.rm =TRUE)# If variance is not defined, or if mean is 0 or 1, normalized variance is 0.if (is.na(var_h) || mean_h ==0|| mean_h ==1) {return(0) } normalized_variance <- var_h / (mean_h * (1- mean_h))return(normalized_variance)}# Compute observed variances for NSCs and progeny cellsobserved_statistic_variance <- mtDNA_number_data %>% dplyr::filter(cell_type =="Progeny") %>%group_by(lineage) %>%summarize(norm_var_lineage =compute_normalized_variance(heteroplasmy),.groups ="drop" ) %>%summarize(mean_norm_var =mean(norm_var_lineage, na.rm =TRUE) ) %>% dplyr::pull(mean_norm_var)# Unnest the simulated results listsimulated_het_df <- purrr::imap_dfr( progeny_sims,~ purrr::map_dfr(.x, ~.x, .id ="simulation_id"),.id ="lineage")# Null distribution# For each of the n_sims simulation IDs,# we calculate the mean normalized variance across all lineagesnull_distribution_variance <- simulated_het_df %>%group_by(simulation_id, lineage) %>%summarize(norm_var_lineage =compute_normalized_variance(heteroplasmy),.groups ="drop_last"# Drops 'lineage', keeps 'simulation_id' ) %>%summarize(mean_norm_var =mean(norm_var_lineage, na.rm =TRUE) ) %>% dplyr::pull(mean_norm_var)# Monte Carlo two-sided p-valuenull_mean_variance <-mean(null_distribution_variance, na.rm =TRUE)p_value_variance <- (sum(abs(null_distribution_variance - null_mean_variance) >=abs(observed_statistic_variance - null_mean_variance)) +1) / (n_sims +1)# Standard Error on the mean for null distributionse_null_variance <-sd(null_distribution_variance, na.rm =TRUE) /sqrt(length(null_distribution_variance))
Figure 10: Monte Carlo test for normalized heteroplasmy variance in progeny cells. Null distribution generated from 10,000 simulations in red and the observed statistic indicated by the dashed line. Two-tailed Monte Carlo p-value shown.
The p-value for a two-sided Monte Carlo test comparing the observed and simulated normalized heteroplasmy variances was 0.1193, suggesting that the observed variance in progeny cells is not significantly different from that expected under a model of random mtDNA segregation.
The standard error on the mean of the null distribution is 1.33e-04, suggesting that the estimate of the mean normalized variance from the simulations is precise.
Density plots of normalized heteroplasmy variance
Kernel density estimation was performed using the Gaussian kernel with bandwidth determined by Silverman’s rule of thumb (Silverman, 1986), calculated as:
where \(sd\) is the standard deviation, \(IQR\) is the interquartile range, and \(n\) is the sample size.
Histogram binwidths were determined using the Freedman-Diaconis rule: \[w = 2 \times \frac{IQR}{n^{1/3}}\]
Figure 11: Density plots of normalized heteroplasmy variance for observed and simulated progeny cells. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb. Histogram binwidths were determined using the Freedman-Diaconis rule.
Density plots of normalized heteroplasmy variance per lineage
Figure 12: Null distributions from 10,000 random segregation simulations (red density plots) for each lineage. The corresponding observed normalized variance is marked by the dashed blue line. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb.
Monte-Carlo test of normalized heteroplasmy shift
We then compared the normalized heteroplasmy shift (Johnston and Jones, 2016) for the observed progeny cells to the null distribution generated from the simulations. The normalized heteroplasmy shift is calculated as:
where \(h_{NSC}\) and \(h_{Progeny}\) represent heteroplasmy as proportions (0-1).
To test whether observed shifts deviate from random segregation:
We calculated the mean heteroplasmy shift per lineage (averaged across ~6 progeny cells per lineage), then took the mean across all 14 lineages as our test statistic
We generated a null distribution from 10,000 simulations of random hypergeometric partitioning, calculating the same statistic for each simulated dataset
We performed a two-sided Monte Carlo test comparing the observed mean shift to the null distribution
The p-value was calculated as the proportion of simulated values that deviated from the null distribution mean by at least as much as the observed value:
where \(S_{obs}\) is the observed mean heteroplasmy shift, \(S_i\) are the simulated values, \(\bar{S}\) is the mean of the null distribution, and \(n = 10{,}000\). This two-tailed test assesses whether the observed shift is unusually large in either direction compared to random segregation.
Monte Carlo test for normalized heteroplasmy shift
# Function to calculate heteroplasmy shift from NSC to progeny cellscompute_heteroplasmy_shift <-function(nsc_heteroplasmy, progeny_heteroplasmy, nsc_cn, progeny_cn) { p_nsc <- nsc_heteroplasmy /100 p_progeny <- progeny_heteroplasmy /100# Squeeze values to avoid -Inf/Inf at 0 and 1# Calculate epsilon based on half a molecule in the largest pool epsilon <-1/ (2*max(c(nsc_cn, progeny_cn))) p_nsc <-pmax(epsilon, pmin(1- epsilon, p_nsc)) p_progeny <-pmax(epsilon, pmin(1- epsilon, p_progeny)) heteroplasmy_shift <-log( (p_progeny * (1- p_nsc)) / ((1- p_progeny) * p_nsc) )return(heteroplasmy_shift)}observed_nsc_data <- mtDNA_number_data %>%filter(cell_type =="NSC") %>%select( lineage,heteroplasmy_nsc = heteroplasmy,cn_nsc = mtDNA_total)# Calculate the observed heteroplasmy shiftsobserved_statistic_shift <- mtDNA_number_data %>%filter(cell_type =="Progeny") %>%left_join(observed_nsc_data, by ="lineage") %>%mutate(het_shift =compute_heteroplasmy_shift(nsc_heteroplasmy = heteroplasmy_nsc,progeny_heteroplasmy = heteroplasmy,nsc_cn = cn_nsc,progeny_cn = mtDNA_total) ) %>%group_by(lineage) %>%summarize(mean_het_shift_lineage =mean(het_shift, na.rm =TRUE),.groups ="drop" ) %>%summarize(mean_het_shift =mean(mean_het_shift_lineage, na.rm =TRUE) ) %>%pull(mean_het_shift)simulated_het_shifts_df <- simulated_het_df %>% dplyr::mutate(lineage =as.numeric(lineage)) %>% dplyr::left_join(observed_nsc_data, by ="lineage") %>% dplyr::mutate(het_shift =compute_heteroplasmy_shift(nsc_heteroplasmy = heteroplasmy_nsc,progeny_heteroplasmy = heteroplasmy,nsc_cn = cn_nsc,progeny_cn = mtDNA_total) )# Null distribution at the lineage levelnull_distribution_shift <- simulated_het_shifts_df %>%group_by(simulation_id, lineage) %>%summarize(mean_het_shift_lineage =mean(het_shift, na.rm =TRUE),.groups ="drop_last" ) %>%summarize(mean_het_shift =mean(mean_het_shift_lineage, na.rm =TRUE),.groups ="drop" ) %>%pull(mean_het_shift)# Monte Carlo two-sided p-valuenull_mean_shift <-mean(null_distribution_shift, na.rm =TRUE)p_value_shift <- (sum(abs(null_distribution_shift - null_mean_shift) >=abs(observed_statistic_shift - null_mean_shift)) +1) / (n_sims +1)# Standard Error on the mean for null distributionse_null_shift <-sd(null_distribution_shift, na.rm =TRUE) /sqrt(length(null_distribution_shift))
Figure 13: Monte Carlo test for normalized heteroplasmy shift in progeny cells. Null distribution generated from 10,000 simulations in red and the observed statistic indicated by the dashed line. Two-tailed Monte Carlo p-value shown.
The p-value for a two-sided Monte Carlo test comparing the observed and simulated mean heteroplasmy shifts was 1e-04, suggesting that the observed shift in progeny cells is significantly different from that expected under a model of random mtDNA segregation.
The standard error on the mean of the null distribution is 9.25e-04, suggesting that the estimate of the mean heteroplasmy shift from the simulations is precise.
Density plots of heteroplasmy shifts
Kernel density estimation was performed using the Gaussian kernel with bandwidth determined by Silverman’s rule of thumb (Silverman, 1986), calculated as:
where \(sd\) is the standard deviation, \(IQR\) is the interquartile range, and \(n\) is the sample size.
Histogram binwidths were determined using the Freedman-Diaconis rule:
\[w = 2 \times \frac{IQR}{n^{1/3}}\]
Figure 14: Density plots of heteroplasmy shifts for observed and simulated progeny cells. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb. Histogram binwidths were determined using Freedman-Diaconis rule.
Density plots of heteroplasmy shifts per lineage
Figure 15: Density plots of the observed progeny heteroplasmy (light blue) overlaid on the null distribution from 10,000 random segregation simulations (red) for each lineage. The heteroplasmy of the parent NSC for each lineage is marked by the dashed line. Kernel density estimation performed using Gaussian kernel with bandwidth = 2.5 times Silverman's rule of thumb, scaled to increase smoothing.
Approximate Bayesian Computation (ABC) analysis
To formally test whether selection is acting during neurogenesis, we performed an Approximate Bayesian Computation (ABC) analysis (Beaumont et al., 2002), a simulation-inference framework where likelihoods are approximated by comparing observed and simulated summary statistics.
The ABC workflow involved:
Summary statistics selection: Identifying the most informative features of the data
Model selection to determine whether selection or pure drift better explains the observed patterns
Parameter estimation to quantify the selection coefficient (s) against mutant mtDNA
Simulation Models
We compared two alternative models of mtDNA inheritance:
Model 1: Pure Bottleneck (s = 0): mtDNA molecules are randomly partitioned to progeny cells through hypergeometric sampling without replacement. Each progeny sequentially draws mtDNA molecules from the NSC pool, which updates after each draw.
Model 2: Selection + Bottleneck (s ≠ 0): Selection modifies mtDNA frequencies before partitioning If \(p\) is the initial mutant mtDNA proportion in the NSC, the post-selection proportion \(p'\) is given by:
\[p' = \frac{p(1+s)}{p(1+s) + (1-p)}\]
where \(s\) is the selection coefficient. Negative \(s\) values indicate purifying selection against mutant mtDNA. After selection, segregation proceeds as in Model 1.
Both models used observed NSC counts and progeny copy numbers for each lineage, ensuring simulations matched the actual bottleneck severity and starting heteroplasmy.
Model definitions for ABC analysis
# Model 1: Random segregation; no selection (s = 0)# Same algorithm as the simulate_lineage_hypergeometric() functionsimulate_pure_bottleneck <-function(nsc_wt, nsc_mut, observed_totals) {# Update mtDNA pool after each draw remaining_wt <- nsc_wt remaining_mut <- nsc_mut mut_samples <-numeric(length(observed_totals))for (i inseq_along(observed_totals)) { current_pool <- remaining_wt + remaining_mutif (current_pool ==0) { mut_samples[i] <-0next } k_draw <-min(observed_totals[i], current_pool) drawn_wt <-rhyper(1,m = remaining_wt,n = remaining_mut,k = k_draw ) mut_samples[i] <- k_draw - drawn_wt remaining_wt <- remaining_wt - drawn_wt remaining_mut <- remaining_mut - mut_samples[i] }# Handle cases where observed_totals is 0 to avoid NaN het_values <- (mut_samples / observed_totals) *100 het_values[is.nan(het_values)] <-0return(het_values)}# Model 2: Selection before division (s != 0)simulate_selection <-function(nsc_wt, nsc_mut, observed_totals, s) { total_pool <- nsc_wt + nsc_mut het_nsc <- nsc_mut / total_pool w_mut <-1+ s freq_mut_eff <- (het_nsc * w_mut) / (het_nsc * w_mut + (1- het_nsc)) eff_mut <-round(freq_mut_eff * total_pool) eff_mut <-max(0, min(eff_mut, total_pool)) eff_wt <- total_pool - eff_mutreturn(simulate_pure_bottleneck(nsc_wt = eff_wt,nsc_mut = eff_mut,observed_totals = observed_totals ))}
Summary statistics selection for ABC
We evaluated 14 candidate summary statistics using Random Forest variable importance measures (Raynal et al., 2019). For each lineage, we simulated 5,000 datasets with \(s \sim \mathcal{U}(-1, 0.1)\), calculated all summary statistics, and trained a regression ABC-RF model to predict \(s\) from the statistics and extracted variable importance scores.
Summary statistics calculation and selection for ABC analysis
# Calculate all summary statistics for initial testingcalc_summary_stats <-function(progeny_het, nsc_het, nsc_cn, progeny_cn, stats_list) {# Return NA values if data is insufficient for calculationsif (all(is.na(progeny_het)) ||length(progeny_het) <2) {return(setNames(rep(NA, length(stats_list)), stats_list)) }# Calculate heteroplasmy shifts using the robust, data-driven epsilon method het_shifts <-compute_heteroplasmy_shift(rep(nsc_het, length(progeny_het)), progeny_het, rep(nsc_cn, length(progeny_het)), progeny_cn)# Squeeze values for skewness/kurtosis to avoid issues at the boundaries progeny_het_squeezed <-pmax(0.001, pmin(99.999, progeny_het))# Initialize an empty list to store results results <-list()# Dynamically calculate only the requested statisticsfor (stat in stats_list) { results[[stat]] <-switch( stat,"mean_het"=mean(progeny_het, na.rm =TRUE),"var_het"=var(progeny_het, na.rm =TRUE),"median_het"=median(progeny_het, na.rm =TRUE),"mean_shift"=mean(het_shifts, na.rm =TRUE),"var_shift"=var(het_shifts, na.rm =TRUE),"min_het"=min(progeny_het, na.rm =TRUE),"max_het"=max(progeny_het, na.rm =TRUE),"q25_het"= { q <-quantile(progeny_het, 0.25, na.rm =TRUE)names(q) <-NULL# Remove the "25%" name q },"q75_het"= { q <-quantile(progeny_het, 0.75, na.rm =TRUE)names(q) <-NULL# Remove the "75%" name q },"skew_het"= moments::skewness(progeny_het_squeezed, na.rm =TRUE),"kurt_het"= moments::kurtosis(progeny_het_squeezed, na.rm =TRUE),"mad_het"=mad(progeny_het, na.rm =TRUE),"prop_below_nsc"=mean(progeny_het < nsc_het, na.rm =TRUE),"norm_var"=compute_normalized_variance(progeny_het),NA# Default case if the stat is not recognized ) }# Return the results as a named vectorreturn(unlist(results))}select_summary_stats <-function( mtDNA_data, stats_list =c("mean_het", "var_het", "median_het", "mean_shift", "var_shift", "min_het", "max_het", "q25_het", "q75_het", "skew_het", "kurt_het", "mad_het", "prop_below_nsc", "norm_var")) {# Number of simulations for testing n_rf_sims <-5000 all_lineage_var_imp_df <- purrr::map_dfr(unique(mtDNA_data$lineage), ~{ lineage_id <- .x# Filter data for current lineage lineage_data <- dplyr::filter(mtDNA_data, lineage == lineage_id) nsc_data <- dplyr::filter(lineage_data, cell_type =="NSC") progeny_data <- dplyr::filter(lineage_data, cell_type =="Progeny")# Generate the reference table for this lineage# Prior distribution for s prior_s_rf <-runif(n_rf_sims, min =-1, max =0.1) summary_stats_matrix <-t(sapply(prior_s_rf, function(s) {# Run simulations under selection model sim_het <-simulate_selection(nsc_wt = nsc_data$mtDNA_wt,nsc_mut = nsc_data$mtDNA_mut,observed_totals = progeny_data$mtDNA_total,s = s )# Calculate summary statistics for the simulated datacalc_summary_stats(progeny_het = sim_het,nsc_het = nsc_data$heteroplasmy,nsc_cn = nsc_data$mtDNA_total,progeny_cn = progeny_data$mtDNA_total,stats_list = stats_list ) })) ref_table <-data.frame(s = prior_s_rf,as.data.frame(summary_stats_matrix) )# Filter out NA values ref_table <- ref_table %>% dplyr::filter(complete.cases(.))# Train the regABCRF model abc_rf_model <-regAbcrf( s ~ .,data = ref_table,ntree =500,paral =TRUE )# Extract variable importance (data.frame(lineage = lineage_id,Statistic =names(abc_rf_model$model.rf$variable.importance),Importance = abc_rf_model$model.rf$variable.importance )) })return(all_lineage_var_imp_df)}var_importance_res <-select_summary_stats(mtDNA_number_data)selected_stats <- var_importance_res %>% dplyr::group_by(Statistic) %>%summarize(MedianImportance =median(Importance, na.rm =TRUE)) %>%arrange(desc(MedianImportance)) %>%slice_head(n =5) %>% dplyr::pull(Statistic) %>%as.character()
Figure 16: Variable importance of 14 candidate summary statistics for ABC parameter estimation. Box plots show Random Forest variable importance scores across all 14 lineages, ranked by median importance. The top 5 statistics (mean heteroplasmy, normalized heteroplasmy shift, median heteroplasmy, and 25th and 75th percentiles of the heteroplasmy distribution) were selected for downstream ABC analysis based on their discriminatory power for estimating selection coefficients.
Based on the variable importance results, we selected the top five summary statistics for ABC model selection and parameter estimation:
Top 5 summary statistics selected for ABC inference
Rank
Summary Statistic
Description
Median Importance
1
Mean Heteroplasmy
Mean heteroplasmy across progeny
142.5
2
Normalized Heteroplasmy Shift
Mean change in heteroplasmy from NSC to progeny
103.5
3
Median Heteroplasmy
Median heteroplasmy across progeny
81.1
4
75th Percentile
75th percentile of progeny heteroplasmy
55.5
5
25th Percentile
25th percentile of progeny heteroplasmy
47.4
ABC Model Selection using ABC-RF
We used ABC Random Forest (ABC-RF) classification (Pudlo et al., 2016) to compare the two models (pure drift vs. selection + drift). For each lineage, we simulated 10,000 datasets (5,000 per model), calculated the five selected summary statistics, and trained an ABC-RF classifier to predict the model from the statistics. We then predicted the best model for the observed data and calculated posterior probabilities, Bayes factors (\(\text{BF} = P(\text{selection})/P(\text{drift})\))
Figure 17: ABC-RF classification performance across lineages. Black line and points show out-of-bag (OOB) error rates for each lineage-specific random forest classifier. Red and blue points indicate model-specific misclassification rates for drift and selection models, respectively.
The ABC-RF classifiers demonstrated discriminatory power between drift and selection models, achieving a mean out-of-bag (OOB) error of 19.6% (SD: 4.1%, range: 11.9%-25.5%) across all lineages. Class-specific misclassification rates were asymmetric, with 12.2% for drift versus 27.1% for selection. The classifier more often misclassifies selection as drift than vice versa, minimizing false positives.
Figure 18: Posterior probabilities for selection vs. drift models across all lineages. Stacked bars show the relative support for selection + bottleneck (blue) versus pure drift (red) models. All 14 lineages favor the selection model, with 13/14 showing P(Selection) > 0.9, indicating strong evidence for selection against mutant mtDNA during NSC division.
All lineages showed stronger support for the selection + bottleneck model compared to pure drift, with posterior probabilities for selection ranging from 0.713 to 0.999, and 13 out of 14 lineages exceeding 0.9. This result is particularly robust given the classifier’s conservative tendency to misclassify selection as drift, indicating that the selection signal in the data is sufficiently strong to overcome this bias.
ABC Parameter Estimation using SMC-ABC
We estimated selection coefficients using Sequential Monte Carlo ABC (Lenormand et al., 2013), an adaptive algorithm that iteratively refines posterior distributions through sequential resampling. We specified a uniform prior, \(s \sim \mathcal{U}(-1, 0.1)\), for the selection coefficient. This prior range allows for strong purifying selection (\(s = -1\)) to weak positive selection (\(s = +0.1\)), with the asymmetry reflecting the observed reduction in heteroplasmy from NSCs to progeny cells. We used 1,500 particles per lineage with acceptance threshold declining adaptively from 50% to 2%. For each proposed \(s\), we simulated progeny heteroplasmy using observed NSC mtDNA counts and calculated the five selected summary statistics.
SMC-ABC parameter estimation for selection coefficient (s)
Figure 19: Selection coefficient estimates for all lineages from Sequential Monte Carlo ABC. Points show posterior median estimates with 95% credible intervals (horizontal bars). All lineages show negative selection coefficients (s < 0), indicating purifying selection against mutant mtDNA. The red dashed line at s = 0 represents neutral evolution.
Figure 20: Posterior distributions of selection coefficients (s) for all lineages. Ridge plots show the full posterior density from SMC-ABC, colored by log₁₀(Bayes Factor) from model selection analysis. Darker colors indicate stronger evidence for selection over drift. All distributions are centered below s = 0 (red dashed line), confirming purifying selection against mutant mtDNA.
Summary statistics for estimated selection coefficients across all lineages
Number of lineages analyzed
14
Lineages with significant selection
14 (100.0%)
Median selection coefficient
-0.510
Mean selection coefficient (± SD)
-0.502 ± 0.099
Range
-0.637 to -0.320
Interquartile range (Q25 - Q75)
-0.565 to -0.438
The median selection coefficient across all lineages was -0.51, with 14 out of 14 lineages showing significant evidence for negative selection (95% credible interval excluding zero).
Figure 21: Parameter recovery validation for SMC-ABC. Points show estimated selection coefficients with 95% credible intervals (error bars) versus true values used in simulations. The dashed diagonal line indicates perfect recovery.
Parameter recovery analysis across 20 simulated datasets demonstrated that SMC-ABC reliably estimates selection coefficients. The method achieved 95% coverage (close to the nominal 95%), with mean absolute error of 0.063 and RMSE of 0.085. Estimates showed no systematic bias (mean bias: 0.034), with points distributed evenly around the line of perfect recovery. The test range (s = -0.8 to -0.2) encompasses all posterior estimates observed across lineages (mean: -0.502 ± 0.099, range: -0.637 to -0.320) with conservative buffers extending 25-37% beyond the most extreme estimates. Parameter uncertainty increased slightly at stronger selection strengths (s < -0.6), reflecting the greater challenge of precisely estimating extreme parameter values. These metrics validate the reliability of our parameter estimation procedure for inferring selection coefficients from experimental data with similar characteristics to our observations.
Statistical tests were two-sided unless otherwise stated. Monte Carlo p-values were calculated using the formula \((k + 1) / (n + 1)\), where \(k\) is the number of simulated values as extreme as the observed, and \(n\) is the number of simulations.
References
Beaumont, M.A., Zhang, W., Balding, D.J., 2002. Approximate bayesian computation in population genetics. Genetics 162, 2025–2035. https://doi.org/10.1093/genetics/162.4.2025
Johnston, I.G., Jones, N.S., 2016. Evolution of cell-to-cell variability in stochastic, controlled, heteroplasmic mtDNA populations. The American Journal of Human Genetics 99, 1150–1162. https://doi.org/10.1016/j.ajhg.2016.09.016
Lenormand, M., Jabot, F., Deffuant, G., 2013. Adaptive approximate bayesian computation for complex models. Computational Statistics 28, 2777–2796. https://doi.org/10.1007/s00180-013-0428-3
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., Robert, C.P., 2016. Reliable ABC model choice via random forests. Bioinformatics 32, 859–866. https://doi.org/10.1093/bioinformatics/btv684
Raynal, L., Marin, J.-M., Pudlo, P., Ribatet, M., Robert, C.P., Estoup, A., 2019. ABC random forests for bayesian parameter inference. Bioinformatics 35, 1720–1728. https://doi.org/10.1093/bioinformatics/bty867
Silverman, B.W., 1986. Density estimation for statistics and data analysis, Monographs on statistics and applied probability. CRC press, London. https://doi.org/10.1201/9781315140919
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H., 2019. Welcome to the tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686
Wonnapinij, P., Chinnery, P.F., Samuels, D.C., 2010. Previous estimates of mitochondrial DNA mutation level variance did not account for sampling error: Comparing the mtDNA genetic bottleneck in mice and humans. American Journal of Human Genetics 86, 540–550. https://doi.org/10.1016/j.ajhg.2010.02.023